# Use VLLM as backend

Lighteval allows you to use `vllm` as backend allowing great speedups.
To use, simply change the `model_args` to reflect the arguments you want to pass to vllm.

```bash
lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
    "leaderboard|truthfulqa:mc|0|0"
```

`vllm` is able to distribute the model across multiple GPUs using data
parallelism, pipeline parallelism or tensor parallelism.
You can choose the parallelism method by setting in the the `model_args`.

For example if you have 4 GPUs you can split it across using `tensor_parallelism`:

```bash
export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"
```

Or, if your model fits on a single GPU, you can use `data_parallelism` to speed up the evaluation:

```bash
lighteval vllm \
    "pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
    "leaderboard|truthfulqa:mc|0|0"
```

## Use a config file

For more advanced configurations, you can use a config file for the model.
An example of a config file is shown below and can be found at `examples/model_configs/vllm_model_config.yaml`.

```bash
lighteval vllm \
    "examples/model_configs/vllm_model_config.yaml" \
    "leaderboard|truthfulqa:mc|0|0"
```

```yaml
model: # Model specific parameters
  base_params:
    model_args: "pretrained=HuggingFaceTB/SmolLM-1.7B,revision=main,dtype=bfloat16" # Model args that you would pass in the command line
  generation: # Generation specific parameters
    temperature: 0.3
    repetition_penalty: 1.0
    frequency_penalty: 0.0
    presence_penalty: 0.0
    seed: 42
    top_k: 0
    min_p: 0.0
    top_p: 0.9
```

> [!WARNING]
> In the case of OOM issues, you might need to reduce the context size of the
> model as well as reduce the `gpu_memory_utilization` parameter.


## Dynamically changing the metric configuration

For special kinds of metrics like `Pass@K` or LiveCodeBench's `codegen` metric, you may need to pass specific values like the number of
generations. This can be done in the `yaml` file in the following way:

```yaml
model: # Model specific parameters
  base_params:
    model_args: "pretrained=HuggingFaceTB/SmolLM-1.7B,revision=main,dtype=bfloat16" # Model args that you would pass in the command line
  generation: # Generation specific parameters
    temperature: 0.3
    repetition_penalty: 1.0
    frequency_penalty: 0.0
    presence_penalty: 0.0
    seed: 42
    top_k: 0
    min_p: 0.0
    top_p: 0.9
  metric_options: # Optional metric arguments
    codegen_pass@1:16:
      num_samples: 16
```

An optional key `metric_options` can be passed in the yaml file,
using the name of the metric or metrics, as defined in the `Metric.metric_name`.
In this case, the `codegen_pass@1:16` metric defined in our tasks will have the `num_samples` updated to 16,
independently of  the number defined by default.
